Strong associations between microbe phenotypes and their network architecture 
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Understanding the dependence and interplay between architecture and function in biological net- 
works has great relevance to disease progression, biological fabrication and biological systems in 
general. We propose methods to assess the association of various microbe characteristics and phe- 
notypes with the topology of their networks. We adopt an automated approach to characterize 
metabolic networks of 32 microbial species using 11 topological metrics from complex networks. 
Clustering allows us to extract the indispensable, independent and informative metrics. Using hi- 
erarchical linear modeling, we identify relevant subgroups of these metrics and establish that they 
associate with microbial phenotypes surprisingly well. This work can serve as a stepping stone to 
cataloging biologically relevant topological properties of networks and towards better modeling of 
phenotypes. The methods we use can also be applied to networks from other disciplines. 
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A prime goal of systems biology is to discover emer- 
gent properties that may be unraveled when a systemic 
view is adopted to gain a comprehensive understanding of 
many processes that occur in biological systems. The re- 
ductionist approach which has held sway in biology over 
the past several decades has successfully identified the 
key components in living systems and many interactions 
among them. However, it almost never presents a holistic 
understanding of how the systemic properties emerge. It 
is now becoming increasingly clear that the functioning 
of biological systems depends crucially on their complex 
underlying structure This complexity is the conse- 
quence of numerous interconnected, dynamic and non- 
linear interactions among the plethora of elements, like 
genes, proteins, and metabolites. But the importance of 
biological networks lies beyond their being the most vis- 
ible signatures of complexity. Understanding the depen- 
dence and interplay between architecture and function in 
biological networks has great relevance to disease progres- 
sion, bio-fabrication and biological systems in general. 

The central issue, then, is to discover whether net- 
works encode systemic events and the precise manner in 
which they do so. Ideally, we would like to understand 
and modify the complex behavior of biological networks, 
which is contingent on the proper level of modeling of 
their molecular interactions. To model the systemic or 
emergent properties, one would have to involve critically, 
the interdependencies among interactions and other or- 
ganizational patterns, on a local level (e.g. network mo- 
tifs) as well as global level (e.g. modularity). Recent 
research in complex systems and networks has presented 
opportunities to properly mine and thence exploit the 
architectural interdependence in networks @, H,IU • 

Multiple metrics exist in complex networks and vari- 
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ous studies have utilized one or few of them at a time, 
to characterize biological networks. Significant research 
has been done to examine various topological properties 
of different networks using computational and analyti- 
cal methods. It has been found that many biological 
networks (just like other empirical networks) may have 
power-law degree distributions [|[, are modular Q and 
hierarchical and have specific distributions of topo- 
logical features which can be used to characterize them 
[a, B El • In addition, topological properties have been 
used to predict missing edg es in networks [ll[ and via- 
bility of mutant strains [12| • 

In this rapid communication, we show that various 
topological metrics (which are the signature of complex 
network architecture) , associate with microbe character- 
istics and phenotypes to a surprisingly high degree. We 
undertake an automated approach using various topolog- 
ical metrics from complex networks to characterize a col- 
lection of various kinds of biological networks and show 
how these metrics associate strongly with microbe char- 
acteristics. Specifically, (i) using publicly available data 
we collect and cross-reference metabolic networks for 32 
different microbes via ten different quantifiable charac- 
teristics and phenotypes, (ii) we use a suite of 11 com- 
plex network metrics, so as to comprehensively compare 
all 32 networks simultaneously, allowing for a much more 
in-depth evaluation of network models [l3| than is pos- 
sible with the usually existing practice of comparing one 
or two particular properties, most commonly the degree 
distribution, (iii) we show that most of the network met- 
rics we use are independent and that multiple metrics 
are necessary to characterize the variability in networks 
meaningfully, (iv) via a hierarchical linear modeling ap- 
proach, we identify subsets of network parameters which 
associate strongly with various microbe characteristics 
and phenotypes. By presenting these strong associations 
and exhibiting the necessity of multiple metrics to do so, 
this work is a step forward toward a systemic cataloging 
of the methods and properties of biological networks that 
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are relevant to the underlying biology, and towards better 
modeling of emergent biological properties. 

The microbe characteristics or phenotypes that are ex- 
plored in this work are: (1) microbe class (MC), (2) 
genome size (GS), (3) GC content (GC), (4) modular- 
ity (Q), (5) number of such modules (Nq), (6) motil- 
ity (MO), (7) competence(CO) and whether these mi- 
crobes are (8) animal pathogens (AP), (9) strict anaer- 
obes (AN), or, (10) extremophiles (EX). 

Microbes are normally classified as archaea or bacteria 
[Tij . Genome size alludes to the sum total of DNA con- 
tained within one copy of a genome. The usual measure 
of it is in terms of mass in picograms or the total num- 
ber of nucleotide base pairs (commonly as millions of base 
pairs, or megabases). Intriguingly, an organism's genome 
size is not directly proportional to its complexity and a 
few microbes have much more DNA compared to other 
microbes. In this context, it is interesting to point out 
that the association between genome size and topologi- 
cal metrics of the networks are among the strongest of all 
phenotypes explored in this work. GC content is the per- 
centage of nitrogenous bases on a DNA molecule which is 
either cytosine or guanine (and not thymine or adenine) . 
Data for genome size and GC content was obtained from 
the NCBI Entrez genome project database [l5[. With 
regard to biological networks, modularity is defined as 
the fraction of edges within modules less the expected 
fraction of such edges. We use a recent algorithm [17[ in 
determining the community-structure in networks which 
incorporates edge directionality. Until recently, the most 
common approach to modularity in complex networks 
literature has been to simply ignore edge direction and 
apply methods developed for community discovery in 
undirected networks. However, this discards potentially 
useful information contained in edge directions, which 
is most commonly a very biologically relevant criterion. 
It should be noted that modularity is intimately con- 
nected to function in biology as the modules typically 
correspond to genetic circuits or pathways 0, [IB] • There- 
fore, we include it here as a phenotypic property rather 
than as a variable. In scenarios where modularity lacks 
apparent connection to function, it is more appropriate 
to treat Q and Nq as input variables. 

Motility allows microbes to move toward desirable en- 
vironments and away from undesirable ones. Compe- 
tence denotes the ability of a cell to take up extracel- 
lular DNA from its environment. Anaerobic organisms 
are those that do not require oxygen for growth and may 
even die in its presence. Extremophiles are organisms 
which thrive in or require extreme physical or gcochcmi- 
cal conditions, in which majority of life on Earth cannot 
survive. Data for phenotypes (6) to (10) have been com- 
piled from Ref. [lj|. While GS,GC,Q,N Q can take on 
any value, the rest of the microbe characteristics or phe- 
notypes are "binary" (e.g., a microbe is either an archaea 
or a bacteria; either aerobic or anerobic, etc.). 

We used metabolic networks of 32 different microbes 
based on data deposited in the WIT database [19]. This 
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FIG. 1: The heatmap over network metrics 



database contains metabolic pathways that were pre- 
dicted using the sequenced genomes of several organisms. 
The nodes in these networks are enzymes, substrates and 
intermediate complexes, while edges represent sequences 
of reactions in the organism's cells. (We had to exclude 
the following three microbial species: A. actinomyc. , R. 
caps, and M. thermoautot., from the original collection 
because many of the microbe characteristics or pheno- 
typic data do not seem to be publicly available for them.) 
The network sizes vary from 595 nodes and 1354 edges 
to 2982 nodes and 7300 edges. 

We calculated a suite of 11 important complex network 
attributes across all 32 networks. These are: the number 
of nodes, N, and edges in the network and the first three 
standardized moments (mean, standard deviation, and 
skewness) of the distributions of geodesic [2(|, between- 
ness coefficient [21] , and, degree, of the network; respec- 
tively denoted as geo\, geoi, geo%, betwi, betw-i, betw^, 
deg\,deg2,deg^. The importance of studying the higher 
moments of distributions is well known in physics [22j. 
The Geodesic was calculated by using the Dijkstra Al- 
gorithm [20| . For normalization, we subtract the mean 
value of a metric (over all species) and then divide by the 
standard deviation of the metric (over all species), for all 
networks. Some of our metrics are robust to measure- 
ment errors. Observing the system (i.e. network) from 
multiple angles, provides a measure of robustness against 
noise (false positives and false negatives). 

The degree of overlap, or dependence, between the at- 
tributes when characterizing networks can be accurately 
assessed by a symmetric heatmap, showing the pairwise 
correlations of the network metrics over all the networks. 
Fig. 1 shows the heatmap over network attributes. We 
start with a 32 dimensional vector (which is the number 
of microbes studied) for each of the 11 metrics. Thus, we 
have 11 points in the 32 dimensional vector space. We 
then calculate the correlation between all pairs of these 
11 points, and color-code the distance. White indicates 
perfect correlation while black indicates anti-correlation. 
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Range (min,max) 


Pbest 


{prandoin} 


p- value 


Best Model Variables 


MC 


Binary 


0.113 


0.507 


< 3 x 10" 5 


N, edges, geot, geo2 , geo3 , betwi , betw2 , betw-i , degi 


GS 


(0.58, 6.3) 


0.476 


1.302 


< 10" 6 


N, edges, betwi , betW2 , betws , deg2 , degz 


GC 


(28.2, 66.6) 


0.763 


1.158 


< 9.8 x 10~ 5 


N, edges, geo\,geo2, geo%, betw\ 


Q 


(0.59, 0.69) 


0.005 


0.033 


< 10~ 6 


N, edges, geo2, geoz, betwi, betW3, degi, deg2 


Nq 


(14, 35) 


2.102 


6.413 


< 10~ 6 


N, edges, geoi , ge02 , geoz , betwi , degi , deg2 


MO 


Binary 


0.315 


0.577 


< 1.4 x 10~ 5 


N, edges, betwz, deg\,deg2, degz 


CO 


Binary 


0.158 


0.683 


< 9 x 10~ 6 


N, edges, geoi , geo2 , geoz , betwi , betW3 , degi , deg2 , degz 


AP 


Binary 


0.325 


0.567 


< 10~ 6 


geoi, geo2,betW3,deg2, degz 


AN 


Binary 


0.359 


0.495 


< 2.66 x 10~ 4 


edges, geoi,geo3, betwi, betW2, betwz, degz 


EX 


Binary 


0.284 


0.540 


< 10" 6 


geoi , geo2 , betwz , degi ,deg2, degz 



TABLE I: Exploring the association of microbe characteristics and phenotypes with network metrics: Microbe class (MC); 
Genome size (GS); GC content (GC); Modularity (Q); Number of modules(iVQ); Motility (MO); Competence (CO); and 
whether the microbes are Animal Pathogens (AP), Strict Anaerobes (AN) or Extremophiles (EX). 



The rows (and by symmetry the columns) are arranged 
automatically so that the rows most similar are placed 
next to each other, as determined by the hierarchical 
clustering algorithm implemented in the heatmap pack- 
age of the R system 23] (as any other clustering scheme, 
this one too has its limitations, e.g. in the placement 
of the edges and nodes columns, which could arguably 
be swapped). Thus, the map allows us to identify clus- 
ters of "similar" network attributes by looking for blocks 
of light-colored squares along the diagonal of the figure. 
Since there is only a small amount of clustering along 
the diagonal, it follows that the network attributes we 
have chosen are relatively independent, and thus, they 
all provide information to our analysis. 

To find how well the organism phenotype associates 
with the underlying network architecture, we consider 
our 11 network metrics (which can be regarded as char- 
acteristics of the architecture) and model each phenotype 
as a linear combination of these metrics. It should be es- 
pecially noted that the basis of linear modeling is not 
to imply that the dependent variables are the cause and 
the explanatory variables are the effect, but that there 
is a significant association between these variables. An- 
ticipating that not all metrics will be pertinent to each 
phenotype, and, in general, to avoid over-fitting we use 
hierarchical linear regression methods to model the phe- 
notypes as linear combinations of subsets of the network 
metrics. To identify the best model we start by assum- 
ing a linear dependence on all 11 variables, because we 
do not know initially which ones associate better than 
others. We then iteratively proceed to exclude variables 
whose absence improves or does not significantly alter the 
quality of the resulting model (we used a specific imple- 
mentation of this iterative procedure through the stepQ 
function of the R system [23j). The model selection is 
guided by minimizing the well-known Akaike Informa- 
tion Criterion [2~i| denoted here as a, a standard measure 
in statistics allowing for selection among various nested 
models, a scores a model based on its goodness-of-fit to 
the data and penalizes models having many parameters. 



If k is the number of parameters in the statistical model, 
and L is the maximum log-likelihood for the estimated 
model, a is defined as: 

a = 2/c-21n(L) (1) 

Thus, we identify the smallest number of independent 
and indispensable network metrics that can be associ- 
ated with the microbe characteristic or phenotype. The 
results for the best model for each phenotype are given in 
Table 1. We use the Root Mean Square Error, p, which 
is a measure of the goodness-of-fit of our model associ- 
ations and the experimental data, p of an estimator X 
with respect to the estimated parameter X is defined as 
the square root of the mean squared error, 

p(X) = y/E((X-X)*). (2) 

We also report the significance of the best model, which 
we obtain by the linear hierarchical modeling procedure 
discussed above by bootstrapping with respect to the 
same model and using a random permutation of the ob- 
served data. We measure the p of these random mod- 
els, Prandom, and how many times (or whether at all) 
Prandom < Pbest, where p best is obviously the p of the best 
model. The number of times this happens is reflected in 
the normalized significance. We observe 10 6 such ran- 
dom permutations, for each microbe phenotype. We also 
performed an analysis of variance (ANOVA) of the differ- 
ence of our model with fewest dependent variables versus 
the model with all 11 variables, and the difference was 
not significant. 

For half of the microbe phenotypes in this study (GS, 
Q, Nq, AP and EX), we do not come across a single 
instance where the Prandom is less than p^est for that 
phenotype. For each of these five phenotypes and also 
for the rest of the ones considered in this study, pbest is 
always less than (p ra ndom}, with very low p-values. This, 
thus indicates a strong association of organism pheno- 
types with the relevant network metrics, in general. 

There are some other facts which are observable from 
Table 1: (i) there is no supremely important single metric 
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associated with each and every phenotype studied here, 
and, (ii) in the present study, this of course rules out 
one or more set of metrics associated with more than 
one phenotype(s). Albeit, the latter occurrence does not 
automatically follow from the former if one or more met- 
rics is consistently observed to be associated with all 
phenotypes. These facts, however, attest to the indis- 
pensability of the simultaneous study of multiple network 
metrics. It is notable that the association patterns are 
non-trivial, even when the microbe phenotype or char- 
acteristic is simply binary, as opposed to the case, when 
it possesses a range of values. The dependence of the 
prediction quality on the number of metrics is also not 
readily ascertainable. For example in AP, five of the 11 
metrics seem to be needed for sufficient representation, 
while eight are required for Q and Nq. However, with six 
metrics for GC and MO, or nine for MC the prediction 
quality is apparently not enhanced. 

Interestingly, the orthogonality of the geodesic and be- 
tweenness metrics which we established before is reflected 
by their consistent appearance in the association results. 
It is entirely possible that the association of other net- 
work metrics, which are not a part of this study, with 
these or other phenotypes or organism characteristics 
could be particularly strong. Exhaustive studies with the 
inclusion of such metrics should bear out this fact. Here 
we focused on metrics that have been shown to be biolog- 
ically pertinent. The approaches adopted here are scal- 
able and can easily accommodate other important met- 
rics, which could be unraveled in future as a result of the 
continuous ongoing research in network theory. 

The importance of this study is in justifying that suit- 
ably identified groups of network metrics, can and should 
be used to meaningfully model and study organism char- 
acteristics. Most immediately, the results can be used 
to build more sophisticated and even predictive models 
of organism phenotypes, based on their network archi- 



tecture. These results are also a good starting point for 
classification or cataloging of biologically relevant topo- 
logical features, that can eventually yield vocabularies 
which cross-reference topology with biological function. 
While still far away, we expect such tabulated and well- 
described architectural features to be akin to biological 
markers in other empirical data. In this sense, our work 
is a modest step toward understanding the precise na- 
ture of interdependence between function and topology 
in biological networks. Followup modeling and simula- 
tions could give valuable insight into a wide range of far- 
reaching issues like the effect of topology on the design 
and evolution of networks. The comprehensive "look-up 
scheme" , elucidated with the present set of biological net- 
works, could also be helpful for other real- world complex 
networks in general. Of course, the measures need not be 
the same as those above and will depend on the nature 
and topology of the network. 

It is well known that various centrality measures play 
an important role in networks and in some cases (e.g., 
in the global airline network (25|), few nodes which have 
relatively low degree but high betweenness could be very 
special. Nodes with high betweenness can act as bot- 
tlenecks for information passage and the role of between- 
ness is well known in epidemiology, information and wire- 
less or sensor networks. The role of betweenness in bio- 
logical networks is being thoroughly exploited in recent 
times 26]. However, to our knowledge, there is almost 
no in-depth work in literature, investigating the role of 
higher moments of the betweenness distribution in bio- 
logical networks. The present work underlines the im- 
portance of such studies. 

We are grateful to Z. Saul for help in data gathering 
for some of our networks. We thank E. Leicht and M. 
Newman for providing us the codes on modularity struc- 
ture in directed networks. This work was funded in part 
by the NSF under Grant No. IIS-0613949. 
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